Estimating the Performance Impact of the MCDRAM on KNL Using Dual-Socket Ivy Bridge Nodes on Cray XC30
نویسندگان
چکیده
NERSC is preparing for its next petascale system, named Cori, a Cray XC system based on the Intel KNL MIC architecture. Each Cori node will have 72 cores (288 threads), 512 bit vector units, and a low capacity (16GB) and high bandwidth (~5x DDR4) on-package memory (MCDRAM or HBM). To help applications get ready for Cori, NERSC has developed optimization strategies that focus on the MPI+OpenMP program model, vectorization, and the HBM. While the optimization on MPI+OpenMP and vectorization can be carried out on today’s multi-core architectures, optimization of the HBM is difficult to perform where the HBM is unavailable. In this paper, we will present our HBM performance analysis on the VASP code, a widely used materials science code, using Intel's development tools, Memkind and AutoHBW, and a dual-socket Ivy Bridge processor node on Edison, a Cray XC30, as a proxy to the HBM on KNL. Keywords-HBM; MCDRAM; KNL; VASP; memory bandwidth; Memkind; AutoHBW; performance
منابع مشابه
CP2K Performance from Cray XT3 to XC30
CP2K is a powerful open-source program for atomistic simulation using a range of methods including Classical potentials, Density Functional Theory based on the Gaussian and Plane Waves approach, and post-DFT methods. CP2K has been designed and optimised for large parallel HPC systems, including a mixed-mode MPI/OpenMP parallelisation, as well as CUDA kernels for particular types of calculations...
متن کاملPerformance of Hybrid MPI/OpenMP VASP on Cray XC40 Based on Intel Knights Landing Many Integrated Core Architecture
With the recent installation of Cori, a Cray XC40 system with Intel Xeon Phi Knights Landing (KNL) many integrated core (MIC) architecture, NERSC is transitioning from the multi-core to the more energy-efficient many-core era. The developers of VASP, a widely used materials science code, have adopted MPI/OpenMP parallelism to better exploit the increased on-node parallelism, wider vector units,...
متن کاملAnalysis of Cray XC30 Performance Using Trinity-NERSC-8 Benchmarks and Comparison with Cray XE6 and IBM BG/Q
In this paper, we examine the performance of a suite of applications on three different architectures: Edison, a Cray XC30 with Intel Ivy Bridge processors; Hopper and Cielo, both Cray XE6’s with AMD Magny–Cours processors; and Mira, an IBM BlueGene/Q with PowerPC A2 processors. The applications chosen are a subset of the applications used in a joint procurement effort between Lawrence Berkeley...
متن کاملOptimizing Cray MPI and SHMEM Software Stacks for Cray-XC Supercomputers based on Intel KNL Processors
HPC applications commonly use Message Passing Interface (MPI) and SHMEM programming models to achieve high performance in a portable manner. With the advent of the Intel MIC processor technology, hybrid programming models that involve the use of MPI/SHMEM along with threading models (such as OpenMP) are gaining traction. However, most current generation MPI implementations are not poised to off...
متن کاملPorting of the DBCSR Library for Sparse Matrix-Matrix Multiplications to Intel Xeon Phi Systems
Multiplication of two sparse matrices is a key operation in the simulation of the electronic structure of systems containing thousands of atoms and electrons. The highly optimized sparse linear algebra library DBCSR (Distributed Block Compressed Sparse Row) has been specifically designed to efficiently perform such sparse matrix-matrix multiplications. This library is the basic building block f...
متن کامل